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Abstract 

This paper proposes a generalization of the conjugate gradient (CG) method used to solve the 
equation Ax = b for a symmetric positive definite matrix A of large size n. The generalization 
consists of permitting the scalar control parameters (= stepsizes in gradient and conjugate gradient 
directions) to be replaced by matrices, so that multiple descent and conjugate directions are updated 
simultaneously. Implementation involves the use of multiple agents or threads and is referred to 
as cooperative CG (cCG), in which the cooperation between agents resides in the fact that the 
calculation of each entry of the control parameter matrix now involves information that comes from 
the other agents. For a sufficiently large dimension n, the use of an optimal number of cores gives the 
result that the multithread implementation has worst case complexity 0(n^^^'''^) in exact arithmetic. 
Numerical experiments, that illustrate the interest of theoretical results, are carried out on a multicore 
computer. 

1 Introduction 

The paradigm of cooperation between agents in order to achieve some common objective has now become 
quite common in many areas such as control and distributed computation, while repesenting a multitude 
of different situations and related mathematical questions [7J [TTJ [TU] . 

In the field of computation, the emphasis has been mainly on the paradigm of parallel computing in 
which some computational task is subdivided into as many subtasks as there are available processors. 
The subdivision naturally induces a communication structure (or graph), connecting processors and 
the challenge is to achieve a subdivision that maximizes concurrency of tasks (hence minimizing total 
computational time) , while simultaneously minimizing communication overhead. This paradigm arose as 
a consequence of the usual architecture of most early multiprocessor machines, in which interprocessor 
communication is a much slower operation than a mathematical operation carried out in the same 
processor. Disadvantages of this approach arise from the difficulty of effectively decomposing a large 
task into minimally connected subtasks, difficulties of analysis and the need for synchronization barriers 
at which all processors wait for the slowest one, in order to exchange information with the correct time 
stamps (i.e., without asymmetric delays). 

More recently, in the area of control, interest has been focused on multiagcnt systems, in which a 
number of agents cooperate amongst themselves, in a distributed manner and also subject to a commu- 
nication graph that describes possible or allowable channels between agents, in order to achieve some 
(computational) task. Similarly, in the area of computation, multicore processors have now become 
common ~ in these processors, each core acommodates a thread which is executed independently of the 
threads in the other cores. Thus, in the context of this paper, which is focused on solution of the linear 
system of equations Ax ~ b for a symmetric positive definite matrix A of large size n, we will assume 
that each agent carries out a task that is represented by one thread that executes on one core, so that, 
in this sense, the words agent and thread can be assumed to represent the same thing. In what follows. 
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unless we are specifically talking about numerical implementation, we will give preference to the word 
agent. 

With the advent of ever larger on-chip memory and multicore processors that allow multithread 
programming, it is now possible to propose a new paradigm in which each thread, with access to a 
common memory, computes its own estimate of the solution to the whole problem (i.e., decomposition 
of the problem into subproblcms is avoided) and the threads exchange information amongst themselves, 
this being the cooperative step. The design of a cooperative algorithm has the objective of ensuring that 
exchanged information is used by the threads in such a way as to reduce overall convergence time. 

The idea of information exchange between two iterative processes was introduced into numerical 
linear algebra long before the advent of multicore processors by Brezinski [3] under the name of hybrid 
procedures, defined as (we quote) "a combination of two arbitrary approximate solutions with coefficients 
summing up to one... (so that) the combination only depends on one parameter whose value is chosen 
in order to minimize the Euclidean norm of the residual vector obtained by the hybrid procedure... 
The two approximate solutions which are combined in a hybrid procedure are usually obtained by two 
iterative methods." The objective of minimizing the residue is to accelerate convergence of the overall 
hybrid procedure. This idea was generalized and discussed in the context of distributed asynchronous 
computation in [T]. 

More specifically, this paper explores the idea of cooperation between p agents (or threads) in the 
context of the conjugate gradient (CG) algorithm applied to an n-dimensional linear system Ax = b, for a 
symmetric positive definite matrix A of large size n. Throughout the paper it is assumed that p < n, and 
even that p <^n: the number of agents may be "large" , but it is usually "much smaller" than the "huge" 
size of matrix A. The famous CG algorithm, proposed in [5], has several interesting properties, both as an 
algorithm in exact arithmetic and as one in finite precision arithmetic [51 H] • However, it is well known 
that, due to its structure, it cannot be parallelized in the conventional sense. In this paper, we revisit 
the CG algorithm from a multithread perspective, which can be seen as a direct generalization of the 
control approach to the CG algorithm proposed in [2l pp. 77-82], in which the scalar control parameters 
(stepsizes in gradient and conjugate gradient directions) are replaced by matrices (i.e., multivariable 
control). The cooperation between agents resides in the fact that the calculation of each entry of the 
control matrix now involves information that comes from the other agents. The method can also be seen 
as a generalization of the traditional CG algorithm in which multiple descent and conjugate directions 
are updated simultaneously. 

The paper is organized as follows. Section [5] briefly recalls the construction, as well as the main 
convergence results, of Conjugate Gradient method. Section [3] then presents the new algorithm, called 
cooperative Conjugate Gradient (cCG ) method. In order to simplify this presentation of the new algorithm, 
the case of p = 2 agents is first introduced in Section 13.11 The general case p > 2 is then stated in full 
generality in Section 13.21 together with analysis results. Complexity issues are broached in Section |4l 
The results stated therein concerns execution of cCG algorithm in exact arithmetic. Section [5] is then 
devoted to numerical experiments with the multi-thread implementation. Section [5] provides conclusions 
and directions for future work. 

Notation For the fixed symmetric definite positive matrix A £ M"^", we define ^-norm in R" by 

= (xMa;)i/2, a; e M" (1) 

and define A- orthogonality (or conjugacy) of vectors by: 

X ±Ay^ x^Ay = 0, x, ?/ e M" . (2) 

We will also have to consider matrices whose columns are vectors of interest. Accordingly, we will say 
that X,Y E R"^P are orthogonal (resp. ^-orthogonal) whenever each column of X is orthogonal (resp. 
A-orthogonal) to each column of Y, that is when 

X^Y = (resp. X'^AY = 0) . (3) 

For any set of vectors g M", i = 0, 1, . . . , fc, we denote respectively {r^j^ and [r^Jg the set of these 
vectors, and the matrix obtained by their concatenation: [r^Jg = [ro ri ... r^] € ]R"><('^+i). The 
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notation span [ri]o will denote the subspace of linear combinations of the columns of the matrix [r^Jp. 
When M" is the ambient vector space, we thus have 

span [r4 = G M" : 37 G E^+\ v = 1^7.'^. = [n]ol^ ■ (4) 

Similarly, for matrices Ri e M"^p, i = 0, . . . , fc, the notation (respectively, [R,]^ € K"x(fc+i)p) 

is used for the set of these matrices (respectively, the matrix obtained as concatenation of the matri- 
ces i?o,i?i,...,i?fc, i.e., = [Ro Ri ... Rk] e R"><('=+i)P.) Also, we write span [R,]^ for the 
subspace of linear combinations of the columns of [Ri]o- 

span [R,]^ ^{ve R" : 37 e m(^'+i)p, « = [R,]^„j^ . (5) 

Notice that this notation generalizes the definition provided earlier for vectors, and that span [R] is 
already meaningful for a single matrix R G R"^p. As an example, dimspan[i?] = rank R. 
Last, for any matrix R G M."^p and for any set J of indices in {1, . . . we will denote 

-^Ij^J — {Rij)l<i<n,j£j- 



2 The Conjugate Gradient Method 

One approach to solving the equation 

Ax^b , (6) 

with A symmetric positive definite and of large dimension, is to minimize instead the convex quadratic 
function 

fix) = ^x'Ax ~ b^x , (7) 

since the unique optimal point is x* = A~^b. Several algorithms are based on the standard idea of 
generating a sequence of points, starting from an arbitrary initial guess, and proceeding in the descent 
direction (negative gradient of f{x)), with an adequate choice of the step size. In mathematical terms: 

Xk+i ^ Xk- akTk, Tk = V/(.Tfc) = Axk ~b , (8) 

where is the step size. The vector represents both the gradient of the cost function / at the current 
point Xk , and the current residue in the process of solving ^ . 

Amongst the possible choices for ak, a most natural one consists in minimizing the value of the 
function / at x^+i, that is in taking 

ak = argmin/(xfc - ark) (9) 

The algorithm obtained using this principle is the Steepest Descent Method^ and one shows easily that 
the optimal value is given by the Rayleigh quotient 

Algorithm ([Sll- dTUl) is convergent, but in general one cannot expect better convergence speed than the 
one provided by 

, X k 
K — 1 



\\xk-x*\\A< [^7^] \\^o-x*\\a (11) 



where k is the condition number 

. Ainax(^) 



(12) 



Amin(A) 

The main weakness of Steepest Descent is the fact that steps taken in the same directions as earlier 
steps are likely to occur. The Conjugate Direction Methods avoid this drawback. Based on a set of n 
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mutually conjugate nonzero vectors {dj}o ^ (that is di dj for any i 7^ j, i,j = 0, . . . , n — 1), the 
family of conjugate direction methods use the sequence generated according to 

Xk+i ^ Xk + akXk, ak = - rj A, ■ (13) 

d\Adk 

It is a classical result that the residue r^+i is orthogonal to the directions di, z = 0, . . . , fc; and that Xk+i 
indeed minimizes / on the affine subspace xq + span [di]^ |5]. As a consequence of this last property, 
conjugate direction methods lead to finite time convergence (in exact arithmetic). 

The Conjugate Gradient method^ developed by Hestenes and Stiefel [5], is the particular method of 
conjugate directions obtained when constructing the conjugate directions by Gram-Schmidt orthogonal- 
ization, achieved at step 1 on the set of the gradients {ri}^. A key point here is that this construction 
can be carried out iteratively. The iterative equations of the Conjugate Gradient method are given in the 
pseudocode instructions of Algorithm [1] Instructions [BHZl constitute the optimal descent process in the 
direction dk\ while instructions [51-fTUl achieve iteratively the orthogonalization of the subspaces span [ri]^. 



Algorithm 1 Conjugate Gradient (CG) algorithm 

1 : choose xo G M" 

2: Tq := Axq — b 

3: do := ro 

4: fc := 

5: while dk =/= do 

6: ak := -rldkidlAdk)'^ 
7: Xk+1 := Xk + akdk 
8: rfc+1 := Axk+i - b 
9: /3fe -rl^,Adk{dlAdkr^ 
10: dk+i := Tk+i + Pkdk 

11: fc fc + 1 

12: end while 



We recall the main properties of this algorithm, in an adapted form, to allow for easier comparison 
with the results to be stated later. 

Theorem 1 (Properties of CG). As long as the vector dk is not zero 

• the vectors {t^Iq are mutually orthogonal, the vectors {dijo o.f^ mutually A-orthogonal, and the 
subspaces span [r,;]Q,span [dil^ and span [A*ro]§ are equal and have dimension (fc + 1); 

• the point Xk+i is the minimizer of f on the affine subspace xq + span [di]Q. 

When the residue vector is zero, the optimum has been attained, showing that CG terminates in finite 
time. Apart from the finite time convergence property, the following formula indicates net improvement 
with respect to Steepest Descent: 

\\xk-x*\\A<2(^^j^^ \\xo-x*\\a (14) 

which represents substantial improvement with respect to (jlip . 

For a proof of this theorem as well as further details on the contents of this section, see [H [5] . 



3 Statement and Analysis of the Cooperative Conjugate Gra- 
dient Method 

3.1 The two-agent case 

In this subsection, in order to aid comprehension and ease notation, the case of two agents (the case 
p = 2) is considered: their estimates at step fc are written as Xk,x'i, respectively, the residues as rk,r'f., 
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and the two descent directions as , d'j, . The gradients at each one of the current estimates are given as 

(r, r',)^A{xk x'^)-Vb, (15) 

with 1 = . As for CG method, we distinguish two steps. 

• Descent step. Given the current residues rfc,r^ and two descent directions rffcjdj,, this step deter- 
mines the upgraded value of the estimates Xk+i, x'l,^^ and therefore of the residues r^+i, rj,_|_]^. 
One allows the use of the two descent directions dk, d'^, thus looking for updates of the form 

{xk+i x'fc+i) = {xk 4) + {dk 4) al . (16) 

The matrix at G M^^^ has to be chosen. In the same spirit as for CG, this choice is made in such a 
way as to minimize f{xk+i) and /(x^^j). This yields in fact two independent minimization problems. 
Denoting 

aj ^ (aji aja) , j = l,2, (17) 
the two optimality conditions are given by 

0=(d, d',yA{xk + {dk d',)al)~{dk d',yb={dk d',yA{x', + {dk d',)al)-{d, d'.Yb. 

(18) 

This shows that the minimum is uniquely defined, and attained when 



(^;)=-(^fc r'.Yid, d',){{d, d',yA{dk d',))-K (19) 



Notice that the two descent directions dk,d'f^ have to be linearly independent for the matrix in to 
be invertible. Similarly to CG algorithm, we have the following four useful properties 

rk+i,r'k+i -L du,d'k ■ (20) 

• Orthogonalization step. The second step consists, given the residues rfc+i,r^^]^ and the current 
descent directions dk,d'^, in determining the next descent directions The latter should be 

A-orthogonal to all the previous descent directions. In fact, it will be sufficient to ensure A-orthogonality 
to dfe, d'j,, as for CG. One takes 

(dfc+i 4+0 = (^fc+i ^'k+i) + {dk d',)l3l. (21) 
The matrix (3k G R^^^ is chosen to ensure the four conditions 

This also leads to two independent problems for the two vectors dk+i, i^fc+i: writing now 

= J = 1,2, (22) 

the previous orthogonality conditions can be written as: 

0=(rfe + (dfc d',)l3iyA{dk <)-(r^. + (4 d'^) PlY A{dk 4-) (23) 
which yields the unique solution 

I3k=(^f^^-irk r',yA{dk d',) {{d^ d',yA{dk 4))"' 



(24) 
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• Summary of cCG in the case p = 2. Putting together the previous findings, we summarize the cCG 



algorithm in the case p = 2 as 

(rfe r',)^A{xk x',)~Vb (25a) 

{xk+i x'f^+i) = {xk x'^) + {rk al (25b) 

ak^~{rk r'.Yidk d'^) {{dk diyA{dk d',))-' (25c) 

{dk+i = {rk+i r^+i) + {dk 4) 1^1 (25d) 

Pk = -{rk+i <)((4 d',yA{dk <))-! (25e) 



3.2 Cooperative CG algorithm: the general case 

We now provide generalization to the case of p > 2 agents. The extension is indeed straightforward from 
(|25p . The matrices whose j-th column represents respectively the solution estimate, the residue and the 
descent direction of agent j, j = 1, . . . ,p, for iteration k are denoted Xk € W^'p, Rk € Dk £ R^'^p. 

In other words, Xk, Rk, Dk stand for the matrices written down (xfe xj,) , (jk , (c?/j d'j,) in Section 

The algorithm cCG in full generality is given as the list of instructions in Algorithm [21 Algorithm cCG 
is a generalization of CG , which is the case p ~ 1. In all algorithms in this paper, comments appear to 
the right of the symbol >. 



Algorithm 2 cooperative Conjugate Gradient (cCG) algorithm 



1 


Choose Xo e E"^P 








2 


Ro := AXo ~ lib 








3 


Do '■= Rq 


t> Generically, rank 


Do 


= p 


4 


fc := 








5 


while Dk is full rank do 








6 


ak -RlDkiDlADk)-^ 


> ak 


e s 


jpxp 


7 


ATfc+i := Xk + Dkol 


> Xk+i 


e IE 


jpxp 


8 


Rk+i AXk+i — Ip^fe 


> Rk+i 


e IE 


jpxp 


9 


Pk ■■= -Rl+^ADk{DlADky^ 


> Ih 


e IB 


5pXp 


10 


Dk+i ^fe+i + Dk(3l 


t> Dk+i 


e IE 


5pXp 


11 


k ^ k + l 








12 


end while 









Theorem 2 (Properties of cCG). As long as the matrix Dk is full rank (that is rank Dk ~ p) 

• the matrices {Ri}o '^'"^ mutually orthogonal, the matrices {Di]^ o-f^ mutually A-orthogonal, and 
the subspaces span [i?,;]Q,span [Dj]o and span [A*i?o]o c'^fi equal and have dimension (k + l)p; 

• for any vector Cj of the canonical basis ofW, the vector Xk+iCj G K" (which constitutes the j-th 
column of Xk+i) is the minimizer of f on the affine subspace XoCj + span [I?i]Q. 

Theorem [2] indicates that, as long as the residue vector Rk is full rank, the algorithm cCG behaves 
essentially as does CG, providing p different estimates at iteration fc, each of them being optimal in an 
afHne subspace constructed from one of the p initial conditions and the common vector space obtained 
from the columns of the direction matrices Di, i = 0, . . . , k — 1. This vector space, span [-Di]§, has 
dimension (fc + l)p: each iteration involves the cancellation of p directions. Notice that different columns 
of the matrices Dk arc not necessarily ^-orthogonal (in other words, DJ^ADk is not necessarily diagonal), 
but, when Rk is full rank, they constitute a set of p independent vectors. The statement as well as the 
proof of this theorem are inspired by the corresponding ones for the conventional CG algorithm given in 
[HI p. 270fFl and [3 p. 390-391]. 

Proof of Theorem [H 

• We first show that for any k, 

span ^ span ■ (26) 
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• We show the first point by induction. Clearly, Dj _La Dk for any j < k, and span [i?i]o = span [Di]Q = 
span [A*i?o]Q when k = 0, while nothing has to be verified for the orthogonality conditions. Assume 
that, for some fc, they are verified for any i < k, let us show them for A: + 1, assuming that Rk+i is full 
rank. 

From lines [ZHl] of Algorithm [2j Rk+i = AXk+i — Ipb = Rk + ADka\- By induction, the columns of 
both the matrices Rk and AD^ are located in span [A'i?o]o^^- Thus, 

Rk+i e span [A'i?o]S+' 

and consequently 

span [i?,]o+^ C span [A'i?o]o+^ . 
On the other hand, for any vector Cj of the canonical basis of W ^ 



Rk+iej ^ span [A]S (27) 

because each residue is orthogonal to the previous descent directions, so that Rk+iej e span [Di]Q for 
some Cj would imply Rk+ie = 0, which contradicts the assumption of full rankness of Rk+i- Indeed, for 
the same reason, one can also show that, for any w e M^* \ {0}, Rk+iv ^ span [Dijg. Using again the fact 
that span = p, one sees that 

span D span [A'Ro]^^^^ 



and indeed 

span [i?j]o+^ = span [A*i?o]o 
One shows similarly from lines (|9))- ((TU)) of Algorithm [5] that 



span [d4+^ C span [^'i?o]S+' , 

and the equality is obtained using the same rank argument. 

The dimension of these sets is (k + 2)p, as they contain the p independent vectors in span [D^+i] 
orthogonal to span 

From line [10] of Algorithm [2j one gets 

DjADk+i = DJA [Rk+i + DkPl) (28) 

For i < k, the first term is zero because ADi <E span [i3j]o^^ and the gradients constituting the columns 
of Rk+i arc orthogonal to any vector in span [ZJ^Jg'^^; while the second term is also zero due to the 
induction hypothesis. For i = k, the right-hand side of (|28p is zero because (3k is precisely chosen to 
ensure this property. Thus the {i'ijo^^ ^.re mutually v4-orthogonal. 

Orthogonality of Rk+i follows from lines [S] and H] of Algorithm [21 The induction hypothesis has been 
proved for fc + 1 concluding the proof of the first part of Theorem [2] 
• [Optimality] . Arguing as in p. 390-391], we can write 

k 

Xfe+ie, =Xoej+^A7,' (29) 

i=0 

for some ji e K^^^". Optimality implies that 

Dj{AXk+ie, - 5) = 0, (30) 
Substituting ([^^ in ([50)) and rewriting in terms of the matrices Ri and yields 

7, = -e]RlD,{D]ADi)-^ (31) 

On the other hand, 

k k 

Vf{Xk+iej) = A(Xoe, + D.nJ) - b = Roe, + ^ ADaJ (32) 

i=0 i=0 
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Thus 

Dl+,Vf{Xk+ie,) = Dl^.Roej (33) 

Substituting in ([5T|) . we get: 

7fc+i = -yf{Xk+ie,yiDl+,ADk+i)-' = e]ak+i (34) 

□ 

A natural question is now to study the cases where at some point of the execution of the algorithm 
cCG one gets rank < p. In the best case, this occurs because one of the columns of Rk is null, say the 
j-th one, meaning that V f{Xkej) = 0, and thus that the A:-th estimate of the j-th agent is equal to the 
optimum x* = A~^b. But, of course, rank Rk can be smaller than p without any column of Rk being 
null. 

First of all, the following result ensures that this rank degeneracy is, in general, avoided during 
algorithm execution. 

Theorem 3 (Genericity of the full rank condition of cCG residues matrix). For an open dense set of 
initial conditions Xq in R"^p, one has during any cCG run 

n 

V < fc < fc* = I -J , rank Rk ^ p ■ (35) 
P 

Moreover 

dimspan ^pI-\ ■ (36) 

Otherwise said: generically, algorithm cCG can be run during k* steps, and 

• any of the columns of Xk* minimizes f on an affine subspace o/R" of codimension p\J^\; 

• application of CQ departing from any of the columns of Xk* yields convergence in at mostn^p\J^\ < 
p — 1 steps. 

The second part of Theorem [3] has to interpreted as follows. When the size n of the matrix A is a 
multiple of the number p of agents, then cCG generically ends up in ^ steps. When this is not the case, 
the estimates Xk* obtained for k = k* minimize the function / on affine subspace whose underlying 
vector subspace is span (see Theorem [2]). The interest of ([55)1 is to show that this subspace is 

quite large: its codimension is p[^J , which is at most equal to p — 1. 

Proof of Theorem\^ The main point consists in showing that generically, k* iterations of the algorithm 
cCG can be conducted without occurrence of the rank deficiency condition. As a matter of fact, the other 
results of the statement are direct consequences of this fact. 

To show the latter, use is made of Theorem O From the properties stated therein, one sees that, for 
any < fc < fc* , the rank of Xk is deficient if and only if a linear combination of the p column vectors of 
Xq pertains to the fcp-dimensional subspace span [Di\Q~^. In a vector space of dimension n > kp, this 
occurs only in the complement of an open dense set. 

Now, if the column vectors of Xk are linearly independent, the same is true for Rk, see line [8] of 
Algorithm [2l and as well for Dk , see line [TOl This completes the proof of Theorem [3l 

□ 

We now study what can be done in case of rank degeneracy. When pk = rank Dk is such that 
< Pk < p, this means that trajectories initially independent have come to a point where the estimates 
in Xk will converge along directions which are now linearly dependent. The natural solution is then to 
choose any full-rank subset of trajectories. We thus propose the modified algorithm [31 

Theorem 4 (Convergence of mcCG algorithm). For any nonzero initial condition Xq, algorithm mcCG 
ends up in k** iterations for some k** < n. Moreover 

• the sequence (j'fc)o<fe<fc** is nonincreasing; 

• any of the columns of Xk-'* minimizes f on an affine subspace ofM" of codimension pk" L~^J/ 

• application of CG departing from any of the columns of Xk** yields convergence in at most n — 
Pk" L^fzrJ ^ Pk" - 1 steps. 

The proof is straightforward and omitted for brevity. 
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Algorithm 3 modified cooperative Conjugate Gradient (mcCG) algorithm 



i 


Choose Xq e M""^P 






2 


i?o 


:= AXo - lib 
'■= Ro 






3 








4 


Pa 


= rank Dq 


> Po is the initial value of the rank 


5 
6 


k :-- 
if 


= 

Po = p then 






7 




go to [m 






8 


else 






9 




choose Jg{1,...,p} such that rank Ro\j^j — 






10 










11 




Ro *i— RoljeJ 




> i?o e R^'o^Po is full rank 


12 




Do ^ Doljej 




l> Dfe e RPoxpo is full j-ank 


13 


end if 






14 


while pi; > do 






15 




au -RlDk{DlADkr^ 




> ak& RP^^P" 


16 




Xk+i '■= Xk + Dkol 




> Xfe+i e RPfcXP*" 


17 




Rk+i ■= AXk+i — l^i^b 




t> e Rp^-xp*" 


18 




h ■■= -Rl+^ADk{DlADkr^ 




e Rp^-xp* 


19 




Dk+i -Rfe+i + Dk(3l 




> Dk+i e Rp^-xp*. 


20 




Pk+i ■■= rank Dk+i 






21 




if pk+i = Pk then 


If 


Pfc+i = Pk , cCG goes on normally 


22 




go to [29] 






23 




else > If pk+ 


I <Pk, 


Pk ^ Pk+i agents are suppressed 


24 




choose J (z {1, . ■ . ,Pk} such that rank Rk+ 




Pk+i 


25 




Xk+i ^ Xk+i\jeJ 




> Xk+i e RP''+ixP'=+i 


26 




Rk+l ^ Rk+l\jeJ 




> Rk+i e RPfe+ixpfc+i is full rank 


27 




Dk ^ -Dfc jgj 




> e RP^+ixpfc+i is full rank 


28 




end if 






29 




k^k + l 






30 


end while 
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4 Computational complexity 



This section is concerned with the evaluation of the gain in computation time of the numerical solution 
of equation ([B]), when using cCG algorithm with p agent, i.e., the gain which is expected is due to the 
parallelism induced by a multithread implementation. We evaluate this issue here assuming computations 
in exact arithmetic. Moreover, thanks to Theorem [3l we adopt the generic assumption that the rank 
of the residue matrices remains constant (and full), and that the computations are then carried out for 
[^J iterations. Disregarding as marginal the supplementary CG steps (see the statement of Theorem [3]), 
we thus consider it to be realistic to quantify the worst case complexity by evaluating the numbers of 
multiplications involved by ^ iterations of cCG. Recall that the case p — I corresponds to the usual CG 
algorithm. 

We propose the multithread implementation detailed in Table 1. 



Table 1: Number of scalar multiplications in k-th iteration 



Operation carried out by 
i-th processor 


Composite result 


Dimension 
of the result 


Number of scalar 
multiplications carried out by the 
ith processor 






n X p 




dL.^ad, 


DlADk 


p X p 


7ip 






p X p 


7ip 




c^k - RlDkiDlADf,)-^ 


p X p 


6 


-Rfc + l,i — Rk,i — ^-OfcCtfc i 


Rk+i - Rk - AD^al 


71 X p 


7ip 


^k + l,i — ^k,i — ^kO^k.i 


Xk+i — Xk - Dkal 


71 X p 


7ip 


K+i.,ADk 


Bi+,ADk 


p X p 


7ip 


Pk., s.t. l3k.i{DlADk) = 


(^k - -Bl+,AD^{DlADk)-' 


p X p 


6 


Dk+l,^ = Rk + l.i + DkPl, 


Dk+i - B-k+i -^Dk^l 


71 X p 


np 


Total number of scalar multiplications per processor and per iteration 


«2 +6np+ "'P+if 



In Table 1, the first column indicates the task carried out at each stage by every processor, and the 
last column the corresponding number of multiplications carried out by a processor. The double lines, 
separating the first row from the second and the second from the third, indicate the necessity of a phase 
of information exchange: every processor at that stage needs to know results from other processors, also 
called a synchronization barrier in computing terminology. The second column, labelled composite result, 
contains the information that is available by pooling the partial results from each processor and the third 
column gives the dimension of this composite result. The fourth and final column contains the number 
of multiplications carried out by the ith processor. The number of scalar multiplications is 

needed to realize Gaussian elimination realized through LU factorization [131 p. 15]. 

As indicated by the last line of Table 1, a total of + 6np+ multiplications per processor 

is needed to complete an iteration. Since, generically speaking, the algorithm ends in at most - iterations 
(see Theorem[3]), an estimate of the worst-case multithread execution time is given by the following result. 

Theorem 5 (Worst-case multithread execution time in exact arithmetic). Generically, multithread ex- 
ecution of cCG using p agents for a linear system (0) of size n requires 

^ ^ fi 2 ^ (p+l)(2p+l) 

N(p) = h on + n (37) 

P 3 

multiplications performed synchronously in parallel by each processor. 

This result has straightforward consequences. 

Corollary 6 (Multithread gain). For problems of size n at least equal to 5, it is always beneficial to use 
p < n processors rather than a single one. In other words, when n > 5, 

V 1 < p < n, A^(l) > N{p) . (38) 
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Proof. One has 



N{1) - N{p) = + -n- ( 271^ + -nA -{n^ - 6n'^ + 5n) = -n{n - l){n - 5) . 
3 \ 3 / 3 3 



Moreover, 



= n \ n 



dp \3 



which is negative for n > 2, while 

dp \y3 

The convexity of N then yields the conclusion that N{p) < N{1) for any 1 < p < n. □ 

Corollary 7 (Optimal multithread gain). For any size n of the problem, there exists a unique optimal 
number p* of processors minimizing N(j)). Moreover, when n — >■ +oo, 



-) 713 (39a) 
N{p*) « ( (^) \l(^y ] n^+h « 1.651^2+1 (39b) 



Proof. One has 



Z 3 V4 



dNip) 4 
- -— + -np + n 



dp 

There exists a unique p* canceling this expression. For this value, one has = p^(|p+ 1), which yields 
the asymptotic behavior given in p9ap . The value in p9bl) is directly deduced. □ 

The conclusion of Corollary [7] is quite important. It shows that solution of Ax = 6 is possible by the 
method proposed here with a cost of 0(n^+3 ) multiplications. This is to be compared with the classical 
results [H]. 



5 Numerical experiments with discussion of multithread imple- 
mentation 

This section reports on a suite of numerical experiments carried out on a set of random symmetric 
matrices of dimensions varying from 1000 to 25000, the latter being the largest dimension that could 
be accommodated in the fast access RAM memory of the multicore processor. The random symmetric 
matrices were generated by choosing random diagonal matrices A, with positive diagonal entries uni- 
formly distributed between 1 and a prespecified condition number, which were then pre-multiplied (resp. 
post-multiplied) by a random orthogonal matrix U (resp. its transpose U^). The random orthogonal 
matrices U were generated using a C translation of Shilon's MATLAB code [12] , which produces a matrix 
distribution uniform over the manifold of orthogonal matrices with respect to the induced R" Lebesgue 
measure. The right hand sides and initial conditions were also randomly generated, with all entries 
uniformly distributed on the interval [—10, 10]. In this preliminary work, the matrices used were dense 
and the use of preconditioners was not investigated. 

In order to evaluate the performance of the algorithm proposed in this paper, a program was written 
in language C. The compiler used was the GNU Compiler Collection (GCC), running under Linux Ubuntu 
10.0.4. For the Linear Algebra calculations, we used the Linear Algebra Package (LAPACK) and the 
Basic Linear Algebra Subprograms (BLAS). Finally, to parallelize the program, we used the Open Multi 
Processing (OMP) API. The processor used was an Intel Core2Quad CPU Q8200 running at 2.33 MHz 
with four cores. 

The pseudo-code in Algorithm 2] gives details of the implementation for three (p = 3) agents. 
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Algorithm 4 Implementation of cooperative Conjugate Gradient (cCG) algorithm 



1 


Choose Xo,lo,^o e 


> All initialized randomly with numbers between —10 and 10 


2 




= A ■ Xq — b 






3 




^A-yo 


- b 






4 


ro.z 


= A-zo 


- b 






5 




= ro,x 








6 


do,y 


= ro,y 








7 


do.z 


= ro,z 








8 


k := 











9 


minres :— 'min{norm{ro^x) , 


norm(rQ 


y),norm{ro^z)) 


10 


while minres > tolerance 


do 




11 










t> Compute matrix-vector products A ■ dk^i 


12 


agent 1 


compute A 


dk,x 




13 


agent 2 


compute A 


dk^y 




14 


agent 3 


compute A 


dk,z 




15 


Barrier 


> Synchroni 


zes all 


3 agents, before proceeding to the next computations 


16 










t> Compute mij 


17 


agent 1 


"^11 := d^^ 


A ■ dk,x > 


mi2 rffc J. • A ■ dk,y 


18 


agent 2 


^13 — dl,x 


A ■ dk,z; 


rn22 dj^y ■ A ■ dk,y 


19 


agent 3 


m23 := dly 


A ■ dk,z ; 


:= dl .^ ■ A ■ dk,z 


20 


Barrier 


t> Synchroni 


zes all 


3 agents, before proceeding to the next computations 



21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
43 
44 
45 
46 
47 
48 
49 
50 
51 
52 
53 
54 
55 
56 



Initialize M :— {mij} 



> Symmetric matrix needed to compute alpha, 



agent 1 
agent 2 
agent 3 

agent 1 
agent 2 
agent 3 

agent 1 
agent 2 
agent 3 

agent 1 
agent 2 
agent 3 

agent 1 
agent 2 
agent 3 

agent 1 
agent 2 
agent 3 

agent 1 
agent 2 
agent 3 

agent 1 
agent 2 
agent 3 

minres : 
k ^ k + 
end while 



ni 
n2 



= \r 



T 
k,x 
T 
k,y 
T 
k.z 



dk,X ; '^k 



dk,: 

dk,. 



Solve M ■ Oil 
Solve M ■ a2 
Solve M ■ a3 



X 
T 

' k,y 
T 

' '''k,z ' 



ni 
«3 



dk,y\'rj,y 
dk.y'irf. ■ 



> Right-hand sides needed to compute alpha 

■ dk,z] 
dk.z] 
dk^z] 

> Computation of alpha 



Xk ^ Xk 

Vk ^ Vk - 



- ai,i ■ dk,x 
0:2,1 ■ dk,x 



- ai,2 • dk,, 

"2,2 • dk,y 



- ai,3 
a2,3 



• dk,. 
dk,. 



t> Update estimates of each agent 



Zk ^ Zk + a3,i • dk,x + ^3,2 • dk,y + a3,3 • dk,z 



t> Update residues of each agent 



rk,x 

fk,y 
rk,z 

ni 
n2 
"3 



A ■ Xk 
A-Vk 
A - Zk - 



[^k,x 
i^k,y 

[rlz ■ 



A 
A- 



- b 
b 
b 

dk,: 
dk,: 



> Right-hand sides needed to compute beta 



T 



k,x 

y 



■A 

■A' 



Solve M ■ Pi 
Solve M ■ 132 
Solve M ■ f33 



dk,y', Tk^x 
dk,y] 'Tf. y 

dk,x]rj,^^ ■ A ■ dk,y;rf. ., 

= ni 
= n2 

= n-i 



■A 

■A- 
A- 



dk,z] 
dk,z] 
dk,z] 



dk,x ^ rk,x + /3i,i • dk,x + /3i,2 
dk,y ^ rk,y + /?2a • dk,x + 1^2,2 
dk,z ^ rk,z + k,! ■ dk,x + h,2 ■ 

normr^ = norm{rk,x) 
normry — norm(rk,y) 
normr^ = norm{rk,z) 
min{normr^ , normr^ , norm^^ ) 



■ dk,y 
dk,y 
dk,y 



' /3l,3 
' /32,3 ' 
- /33,3 • 



dk,z 
dk,z 
dk.z 



> Computation of beta 



> Update of directions 



O Calculate of residual norms 



1 
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5.1 Evaluating speedup 



The results of the Cooperative 3 agent cCG, in comparison with classic CG, with a tolerance of 10~"^, 
and matrices with different sizes, but all with the same condition number of 10^, are shown in Figure 1. 
Multiple tests were performed, using different randomly generated initial conditions (20 different initial 
conditions for the small matrices and 10 for the bigger ones). Figure 1 shows the mean values computed 
for these tests. 



Mean convergence time vs. matrix dimension for the CG and cCG aigorithims 



11000 
10000 
9000 
8000 
7000 
f 6000 
ii 5000 
4000 
3000 
2000 
1000 



-Convergence time for CG algorithm 
-Convergence time for cCG algorithm 




1 1.5 

Matrix dimension 



Figure 1: Mean time to convergence for random test matrices of dimensions varying from 1000 to 25000, 
for 3 agent cCG and standard CG algorithms. 



The iteration speedup of cCG in comparison with CG is defined as the number of iterations that CG took 
to converge divided by the number of iterations cCG took to converge and the experimental results are 
shown in Figure [5J which also shows the classical speed-up, which is the ratio of the time to convergence, 
i.e., the time taken to run the main loop until convergence, for CG versus cCG. 



Speedups (iteration, time) for cCG versus CG as a function of matrix dimension 









— Iteration Speedup 
— Speed-Up 


T.J 






number of CG iterations / number of cCG iterations / 

























Matrix dimension 



Figure 2: Average speedups of Cooperative 3 agent cCG over classic CG for random test matrices of 
dimensions varying from 1000 to 25000. 

The speedups seem to be roughly equal up to a certain size of matrix {n = 16000); however, above 
this dimension, there is an increasing trend for both speedups. 

The numerical results obtained show that cCG, using 3 agents, leads to an improvement in compar- 
ison with the usual CG algorithm. The average iteration speedup and the classical speedup of cCG are 
respectively, 1.62 and 1.94, indicating that cCG converges almost twice as fast as CG for dense matrices 
with reasonably well-separated eigenvalues. 

5.2 Verifying the complexity estimates 

Figure 3 shows the mean time spent per iteration in seconds (points plotted as squares), versus matrix 
dimension, as well as the parabola fitted to this data, using least squares. Using the result from the last 
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row of table 1 and multiplying it by the mean time per scalar multiplication, we obtain the parabola 
(dash-dotted line in Figure 3) expected in theory. In order to estimate the time per scalar multiplication, 
we divided the experimentally obtained mean total time spent on each iteration and divided it by the 
number of scalar multiplications performed in each iteration. This was done for each matrix dimension. 
Since the same multicore processor is being used for all experiments, each of these divisions should 
generate the same value of time taken to carry out each scalar multiplication, regardless of matrix 
dimension. It was observed that these divisions produced a data set which has a mean value of 8.10 
nanoseconds per scalar multiplication, with a standard deviation of 1.01 nanoseconds, showing that the 
estimate is reasonable. 



Mean time per iteration versus matrix dimension 




IVlatrix dimension (n) in units of 10^ 



Figure 3: Mean time per iteration versus problem dimension 



Now, from equation (j37p . substituting p — 3, neglecting small order terms, and multiplying it by the 
estimated mean time per scalar multiplication (8.10 nanoseconds), the number of matrix multiplications 
per iteration, N(p),p = 3, is a cubic polynomial in n. Thus, the logarithm of the dimension (n) of the 
problem versus the logarithm of time needed to convergence is expected to be a straight line of slope 3. 
Figure 4 shows this straight line, fitted to the data (squares) by least squares. 



Time to convergence vs. matrix dimension (N) 





















□ ExDerimental data 












Least squares 


fit: y-2.542-x- 17.658 












































































































1 




1 - 



Log{Matrix dimension (n)) 



Figure 4: Log-log plot of mean time to convergence versus problem dimension 



Its slope (2.542) is fairly close to 3, and data seems to follow a linear trend. The deviation of the 
slope from the ideal value has several probable causes, the first one being that the exact exponent of 3 
is a result of a worst case analysis of CG in exact arithmetic. It is known that CG usually converges, to a 
reasonable tolerance, in much less than n iterations, where n is the matrix dimension 0. 

Similarly, the logarithm of the number of iterations needed to convergence versus the logarithm of the 
dimension of the problem should also follow a linear trend. Since the number of iterations is expected 
to be n/3, the slope of this line should be 1. This log-log plot is shown in figure[SJ in which the straight 
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line was fitted by least squares to the original data (red squares). The slope (0.501) of the fitted line is 
smaller than 1, but is seen to fit the data well (small residuals). The fact that both slopes are smaller than 
their expected values indicates that the cCG algorithm is converging faster than the worst case estimate. 
Another reason is that a fairly coarse tolerance of lO^'^ is used, and experiments reported show that 
decreasing the tolerance favors the cCG algorithm even more. Specifically, for a randomly generated 
matrix of dimension 8000 and condition number 10^, Table 2 shows the mean number of iterations and 
time to convergence, calculated for 10 different initial conditions, for the CGand cCG algorithms, as the 
tolerance is varied from 10"'^ to 10~^ 



Iterations needed to convergence (with tolerance 1 ^) versus matrix dimension 









1 "^„ai 






□ Experimental data points 


















Fitted curve: 


y-0.498-x + 1.376 
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Matrix dimension (n) in units of 10"^ 



Figure 5: Iterations needed to convergence versus problem dimension 



Table 2: Mean number of iterations and mean time to convergence for the CG and the cCG algorithms, 
as a function of tolerance used in the stopping criterion. 



Tolerance 


CG 


cCG 


Time (s) 


Iterations 


Time (s) 


Iterations 


10-^ 


148.20 


372.20 


121.80 


299.70 


10-'' 


165.20 


414.00 


125.70 


319.40 


10-^ 


177.40 


444.90 


132.00 


335.00 


io-« 


192.70 


476.80 


134.00 


352.70 


10-7 


206.50 


510.70 


135.20 


336.20 


io-« 


235.10 


538.40 


137.10 


373.50 


io-*J 


275.00 


559.70 


141.90 


381.20 



The data used to generate all the graphs in the figures above is shown in tables 3 and 4. 



6 Concluding Remarks 

This paper proposed a new cooperative conjugate gradient (cCG) method for linear systems with symmet- 
ric positive definite coefficient matrices. This cCG method permits efficient implementation on a multicore 
computer and experimental results bear out the main theoretical properties, namely, that speedups close 
to the theoretical value of p, when a p-core computer is used, are possible, when the matrix dimension is 
suitably large. The experimental results of the current study were limited to dense randomly generated 
matrices and only 3 cores of a 4 core computer with a relatively small on-chip shared memory were 
used. Future work will include the study of the method on matrices that come from real applications 
and are typically sparse and sometimes ill-conditioned (which will necessitate the use of preconditioners) 
on larger multi-core machines. The use of larger machines should also permit exploration of the notable 
theoretical result (corollary [7]) that, in the asymptotic limit, as n becomes large, implying that p also 
increases according to (j39ap , solution of Ax = 6 is possible by the method proposed here with a cost of 
0(n^''"3) multiplications. 
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Matrix Dimension 


Number of iterations 


Time (s) 


CG 


CCG 


CG 


CCG 


1000 


245.50 


137.05 


1.80 


1.05 


1500 


230.15 


146.65 


3.85 


3.05 


2000 


303.35 


182.60 


8.05 


5.45 


2500 


231.60 


158.20 


11.05 


7.25 


3000 


347.00 


221.35 


20.65 


15.30 


3500 


402.15 


240.50 


32.15 


23.35 


4000 


399.85 


257.45 


40.30 


31.85 


4500 


391.85 


237.10 


51.95 


40.70 


5000 


481.60 


270.05 


77.00 


57.05 


6000 


351.90 


261.70 


81.10 


70.60 


7000 


390.90 


293.30 


121.70 


95.00 


8000 


372.20 


299.70 


148.20 


121.80 


9000 


659.90 


386.50 


343.50 


191.50 


10000 


894.70 


456.20 


532.60 


480.80 


11000 


667.00 


413.20 


614.40 


413.20 


12000 


780.00 


448.20 


673.00 


537.40 


13000 


582.80 


386.50 


753.90 


548.60 


14000 


853.20 


477.30 


1022.70 


769.00 


15000 


852.40 


460.60 


1543.00 


841.70 


16000 


813.60 


514.40 


2070.00 


922.70 


17000 


842.20 


548.90 


3921.60 


1277.20 


18000 


802.50 


485.70 


4204.70 


1325.20 


19000 


884.50 


518.00 


5171.50 


1836.70 


20000 


882.10 


501.90 


5703.70 


2072.60 


21000 


1064.30 


638.00 


7614.70 


2526.40 


22000 


7671.80 


2537.60 


985.10 


607.60 


23000 


7045.60 


2516.40 


826.30 


597.70 


24000 


9969.20 


3065.20 


1040.90 


617.40 


25000 


1114.70 


617.20 


11237.40 


3067.50 



Tabic 3: Average results for multiple test matrices of dimensions varying from 1000 to 25000, for Coop- 
erative 3 agent cCGand for classic CG. 
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Matrix Dimension 


Iteration Ratio 


Speed-Up 


1000 


1.79 


1.71 


1500 


1.40 


1.26 


2000 


1.66 


1.48 


2500 


1.46 


1.52 


3000 
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1.67 
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